
5 The present invention rel3f.es to the fol lowing O.S, patent 

applications, the subject matter of. -which >.s hereby 
incorporated by reference*. (is u.S. application entitled 
reread Switch Control in a Multithreaded Processor System, 
Serial Mother 08/9 57,005 riled 23 October 3997 concurrently 

10 herewith; (2} U.S. application entitled An appaza tus and Method 
to Guarantee Forward Progress in a Kui r. i threaded Fx-ocessor, 
Serial dumber 08/950,875 fixed 23 October 1997 concurrently 
herewith; (3) 0,S> application entitled Altering Thread 
Priorities in a Multithreaded Processor, Serial Number 

15 08/958,718, filed 23 October 1&S7 concurrently herewith; W 
U.S. application entitle Method and Apparatus v.o Force a 
Thread Switch in a Multithreaded Processor, Serial Mnmber 
03/956,57'? filed 23 October 1997 concurrently herewith? {5) 
0,8. application entitled Background Completion of Instruction 

20 and Associated Fetch Pcguest in a ;«-,! ■ ; thread Processor, serial 
Number 773,572 filed 27 December 1996; tS 5 U.S. application 
entitled Multi -Entry Fully Associative Transi tion Cache, Serial 
dumber 761,378 riled 09 December 1996; (7) U.S. application 
entitled Msfctod and Apparatus for Prioritizing and Routing 

25 Commands from a Cctmand Source to a Con^and Sink, Serial Nuraber 

751,380 fried 09 December 1990; {8} U.S. application entitled 
Method and Apparatus for Tracking Processing of a Command, 
Serial Humfoer 761,379 filed 09 December :i99S; {95 U.S. 
application entitled Mechod and System for Bx;hanced Xuiclchread 

30 Operation in a Data Processing System by Reducing Memory Access 
Latency Delays, Serial Number 473,692 filed 7 dune 1995; and 
(10) U.S. Patent 5,778*243 entitled Multithreaded Ceil far a 
Memory, issued 07 duly 1938 . 
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The present ir-ver.u-.ion relates in. general to an improved 
method for and apparatus- of a computer data processing system; 
and in particular,- to an Improved high performance 
5 iranl t.i threaded computer data processing system and method 

embodied in the hardware of the processor . 

The fundamental structure of. a modern computer includes 
peripheral devices to communicate information to and from the 
outside world; such peripheral devices may be keyboards, 

1Q monitors, tape drives.. communication .lines coupled to a. 

network, etc. Also included in the basic structure of the 
computer i?; the hardware necessary to receive, process, and 
deliver this .! d format iors from and to t.he outside world, 
including busses,, memory units , input/output {I/O) controllers 

IS sforage devices, and at least one central processing un.it 

(CPU; , etc. The CPU is the brain of the system. It executes 
the instructions vhich comprise a computer program end directs 
the operation of the other system component a; , 

From the standpoint of the. computer's hardware, most 

20 systems operate in fundamental ly the same manner'- Processors 

actually petr" oru- very simple operations quickly, such as 
arithmetic, logical eompa* 1 sons , and movement of data from one 
location to another. Programs which direct a computer to 
perform massive numbers of these simple operations give the 

25 illusion that the computer is doing something sophisticated, 

What is perceived by the user as <x new or improved capability 
of a computer system., hov/ever, may actual iy be the machine 
performing the same simple operations, but much faster. 
Therefore continuing improvements to computer systems require 

30 that these systems be made ever faster. 

One measurement of the overall speed of a computer system., 
also called the throughput., is measured as the numbe.; of 
operations performed pet unit of time, ConceptuaX.lv, the 
simplest or ail. possible improvements to system speed is to 

35 increase the clock speeds of the various components, 

particularly the clock speed of the processor. So that if 
everything runs twice as .fast but otherwise works in exactly 
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the same manner , the system wili pertorm a aiveix task in half 
the time. Computer processors which *-ere constructed from 
di screes components years ago performed s ignif icantiy faster 
by shrinking the sixe and reducing the number of. components? 
5 eventually che entire processor was packaged as an integrated 
circuit on a single chip. The reduced si.se made it possible 
to increase the eloe): speed of the processor., and. accordingly 
increase system speed. 

Despite the enormous improvement in speed obtained from 

10 integrated circuitry, the demand for ever faster computer 
systems sc.; 11 exists. Hardware designers have been able r.o 
obtain still further improvements in speed by greater 
iBtegration, by further reducing the size or the circuits , and 
by other techniques . Designer, however, thitik that physical 

IS siae reductions canoot continue ind^f. in; teiy and there are 

I units to continually increasing processor clock speeds, 
Ac tent, ion has therefore been directed to other approaches for 
further improvements in overall speed of the computer syst.em, 
Witnout chang ing the clock, speed, it. is still possible to 

20 improve system speed by using .mult .;.pl e processors, The modest 

cost. of. individual processors packaged on integrated circuit 
chips has made this practical. The use of slave processors 
considerably improves system speed by oft -loadliv:; work £ron> the 
CPU to the slave processor. For instance, slave processors 

25 routinely execute repetitive and single special purpose 
programs , such as input/output device eorssauni .est ions and 
control. it: is also possible for multiple CPUs to be placed 
i n a s i n q 1 e compu t or s y s t em < t y p i ca 1 i y a he s t ■• b a sea s y s t em 
which services multiple users simultaneously. Each of the 

30 different CPUs can separately execute a different task on 

behalf, of a different user,, thus increasing the overall speed 
of the system to execofo saultiple tasks simultaneously. It is 
much more difficult, however, to improve the speed at which a 
single task, such as an application program, executes. 

35 Coord ma ting the execution and delivery of results of various 
functions among multiple CPUs is a tricky business. For slave 
1/0 processors this is not so difficult because the functions 



are pre-defined and iiarited but lor reu :i t ip i e CPUs executing 
general purpose application programs it is much more difficult 
to coordinate functions because,- m part, system designers do 
not know the details of the programs in advance. Most. 
5 application programs f oi Low a single path or fiow of steps 

per Horded by the processor. wniie it: is sometimes possible to 
break up this single path into multiple parallel paths, a 
universal application for doing so is still being researched. 
Generally, breaking a lengthy task into smaller tashs for 
10 parallel processing by multiple processors is done by a 

software engineer writino coda on a case ■ by ■ ease basis. This 
ad hoc approach is especially probleraat ic tor executing 
commercial transactions which are not necessarily repetitive 
or predictable.. 

15 Tims, while multiple processors improve overall system 

performance, there are still .many reasons to .-.reprove the speed 
of tne individual CPU. IS the CPU clock speed is given, it .is 
possible tc further increase the speed of the CPU, i.e., r..he 
number of operations executed per second, by increasing the 

20 average number of operations executed pe.; clock cycle, A 

common architecture for high performance, sinuie-chip 
microprocessors is the reduced instruction set computer (RISC) 
architecture characterised by a small sin-plif led set of 
frequently used instructions for rapid execution,- those simple 

25 operations performed quickly mentioned earlier. As 

sesuiconduotor technology nas advanced, the goal o.f E1SC 
architecture has been to develop processors capable of 
executl.no one or more instructions on each clock cycle of the 
machine. Another approach to increase the average number of. 

30 operations executed per clock cycle is to modify v. he hardware 
within the CPU. This throughput Treasure, clock cycles per 
instruction, is common iy used to characterise architectures for 
hi-gh performance processors, instruction pipelining and cache 
memories are computer architectural features that have made 

35 this achievement possible. Pipeline instruction execution 
allows subsequent instructions to begin execution before 
previous iy iss-ued instructions have iinrshed. Cache, memories 
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store frequently used ana ether data nearer the processor and 
allow Instruction execution to continue, in most cases, w.l thouf. 
war tine the full access time of. a main memory. Some 
.improvement has also been demonstrated with multiple execution 
5 units with look ahead hardware for rinding instructions to 

execute in parallel. 

The petf ormance of a conventional RISC processor can be 
further increased in the superscalar computer end the Very Long 
Instruction word {VLlVvi computer.. noth of which execute more 

10 than one instruction in parallel per processor cycle. In these 

architectures, multiple tunc-: icn^ 1 or execution unite are 
provided to run multiple pipelines in parallel. In a 
sapor scaler; erch.1 t eoture , instructions may be completed in- 
order an! out -of -order . In-order completion means no 

15 instruction can complete before all instructions dispatched 
ahead ci" it have been como-ieted. Out - oi • order couplet Ion means 
that an instruction ;.s allowed to complete Pei'ore ail 
instructions ahead of it have neon oompl eted f as lone? as a 
predefined ruj.es are satisfied, 

20 For both in-order and out-of-order execution in 

superscalar systems, pipelines w.:.i! stall under certain 
circumstances. An instruction that is dependent, upon the 
result:-; of a previously dispatched instruction that "has not yet 
completed may cause the pipeline to stall. For instance, 

25 instructions dependent on a load/store instruction in which the 
necessary data Is not in the cache, s . e . , a cache miss, cannot 
he executed until the data becomes available in the cache. 
Maintaining the requisite eat a in the cache necessary for 
continued execution anu to sustain a hcgh hit ratio, i.e., the 

30 number of requests for data compared to the number of etoes the 

data was readily available in the cache,- is .not trivial 
especially for computations involving- large data structures. 
A cache m.iss can cause the pipelines to stall for several 
cycles, and the total, amount of memory latency will foe severe 

35 if the data is not available most of the time. Although memory 
devices used feet main memory are becoming faster., the speed gap 
between such memory chips and nigh -end processors is becoming 
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increasingly larger, Accordingly , a significant amount of 
execution rime in current high *? ad processor designs is spent 
waiting for resolution of cache hisses and these memory access 
del ays use an .increasing proportion of processor execution 

And yet another techn.iq5.1e to improve the efficiency of 
hardware within the CPU is to divide a processing task into 
independently executable sequences or instructions called 
threads. This technique is related to teeaking a lerger task 

10 into smaller rasks for independent execution by different 

processors except here, the threads are to be executed by the 
same processor. When a CPU then, for any of a number of 
r ""•<. f ■> 1 t ' r 1 u \< -i- 1 N v Q R ,3 Q? 

fhese threads.- the CPU switches to and oxecut.es another thread. 

15 This is the subject of the invention described herein which 

i aco.i por a f e s I s a r e*?a r e mu i t i thr ea di ng to to i er a t e memory 
latency. The terra "muiti threading" as defined in the computer 
architecture community is nor. the same as the software use of 
the term vdilch means one task subdivided into ;nu It.tpi e related 

20 threads. In the architecture definition, the threads inay be 
independent. Therefore "hardware mu 1 r. i r bread 5 ng " is often used 
to distiriguish the two uses of the term. The present invention 
.incorporates the term rrsij It i threading to connote, hardware 
multithreading, 

25 r-'jnirir breading permits the processors ; pipeline (s) to do 

useful work on different threads wbes 0 ps.pel.ine stall 
condition is detected for the current thread,. Multithreading 
also pe.nrc.ts processors implement inq non -pipeline nrehrt.ee teres 
to do useful work for a separate thread when a stall condition 

30 is detected for a current thread. There are two basic forms 

of rouiti threading, A tradifioxra! form is fo keep 3M threads, 
or states, in the processor and interleave the threads on a 
cycle-by • cycle basis. This ei. ir-dni5t.es all pipeline 

dependencies because instructions in a single thread are 

35 separated. The other form of multithreading,, and the one 
considered by the present invention, is to interleave the 
threads on some long ■• latency event. 



Tr a d 1 t i ona i t orrns of. mui t i thr eadi ug 1 n vol v es repl i ca t in;;; 
the pr\*wr '■'otu 1 N ^ i > ^ i - h" ft ^ v nco, U ~ a 
processor implementing the architecture sold under the trade 
name PowerPC^ t:o perform mu i t i t h r eaOl ing , the processor must 
5 maintain N states to run H thread:-:. Accordingly, the Eoiiowing 
are repi.icar.sd N times; general purpose reg.iKr.ers, floating 
point registers,, condition registers, floating point status and 
control register, count register, iink register,- exception 
register, save/restore registers, and special purpose 

10 registers- Additionally, the special buffers, .such as a 

sega-er.it lookaside buffer, can be repj seated or each entry can 
be tagged -with the. thread number and, is not, must be flushed 
on every thread switch- 'Also, some branch prediction, 
imeehanisms, e.g., the correlation register and the return 

15 stack, should also be replicated. Fortunately, there is no 
need to replicate sorrae of the larger functions of the processor 
such as; level one instruction cache (hi I -cache) , level one 
data cache (Li D- caches, instruction buffer, store queue, 
instruction dispatcher. function*:; or execution units, 

20 pipelines., translation lookaside buffer (TLB), and branch 

history table. When one thread encounters a delay, the 
processor rapidly switches to another thread. The execution 
of this i:hread overlaps with the memory delay on the first 
thread. 

25 Existing .multithreading techniques describe switching 

threads or; a cache miss or a memory reference. A primary 
example of this technique say be reviewed in "Sparcie; An 
Evoi u t i o • urr y D& sign f: o ■: ha rg e • Sea 1 e Mu 1 r. i pr oce s sor s , ,! by 
Agarwal at a.!.-, I EKE yicro volume 13, No, "3, pp. '36-60, June 

30 1993. As applied in a RISC architecture, multiple register 

sets nor-uaiiy atiiiced to support function calls are mod if. red 
to maintain srrslt.iple threads. Eight overlapping register 
wi ndows a r e mod i i. i ed t o become f ou r non ■ over la pping r eg i s t.er 
sets, wherein each register set is a reserve for trap and 

35 message handling. Th.i s system discloses a thread switch which 
occurs on each first ievei cache miss that, results in a remote 
memory request, While this system represents an advance in the 
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art, modern processor designs often utilize a multiple level 
cache or high speed memory which is attached to the processor. 
The processor systeu- utilizes some we 1 .1 -Known algorithm to 
decide what portion of: its main memory scorn- xvj.il be loaded 
5 within each level of. cache arid thus, each time a memory 

reference occurs which is not present within the first level 
of cache the processor roust attempt to obtain that memory 
reference from a second or higher level of cache. 

T.t is thus ars object of the invention to provide an 
10 improved data processing system which can reduce delays due to 

memory latency in a multilevel cache system utilizing hardware 
logic and registers embodied in a real ti thread data processing 
system, 

15 The invention addresses this object by providing a 

multithreaded processor capable of switching execution bet. ween 
two threads of instructions, and thread switch logic embodied 
in hardware registers with optional software override of thread 
switch conditions. Processing various states of various 

20 threads of. instructions allows optimisation of the use of the 

processor a mom; the threads. a.I lowing the processor to 
execute a second thread of. instructions increases processor 
-utilisation which is otherwise idle when It is retrieving 
necessary data anu/o.; .instructions from various memory 

25 elements, such as caches , memor ies , externa.: ? /O, direct access 

storage devices for a first thread- The conditions of thread 
switching can he different per thread or can be changed, during 
processing oy the use of a software thread control manager. 

The invention provides a hardware thread switch control 

20 register containing bits which can be enabled to embody the 
events and cause a multithreaded processr r to switch threads. 
This hardware register has the farther advantage of improving 
processor performance because it is much faster than software 
thread switch control . 

35 Another aspect of the invention is s computer system 

having a multi threaded processor capable of switching 
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processing between at least fcv?o threads of instructions when 
the multithreaded processor experiences one of. a plurality of 
processor latency events. The computer system also has at 
least one thread state register operative.? y connected t:o the 
5 multithreaded processor., to store a state of the threads of 
instructions wherein the state of each thread of. instructions 
changes when the processor switches processing to each thread. 
The system also has at Isast one thread switch control register 
opera Lively connected to the thread state register (s) and to 

10 the multithreaded processor, to store a plurality of thread 

switch control even-: s which thread switch control events are 
enabled by setting a corresponding plurality of enable bits. 
The computer system further co;tpr ises a plurality or internal 
connections connecting the ooitithreaded processor to a 

IS plurality of memory elements. Access to any of the memory 

eiestente by the mui ti threaded processor causes a processor 
latency event and the invention also has at .least one external 
connection connecting the ran Ititnr eaded processor r:.o an 
external oeoory device, a coojaunicarion device,- a computer 

20 network, or an input./ output device wherein access to any of: the 

devices or the network by the multithreaded processor also 
causes a plurality of processor latency events, When one of 
the threads executing in the ra.ati threaded processor is unable 
to continue execution because of one of the processor latency 

25 events and when that processor latency event is a thread switch 

control event whose bit is enabled, the multithreaded processor 
switches execution to another of the threads. 

The thread switch control register has a plurality of 
hits, each associated uniquely with one of a plurality of 

30 thread switch control events and if one of the bits is enabled, 

the thread switch control event associated with that hit causes 
the -rail ti threaded processor ; s ) to switch from one thread of 
instructions to another thread of instructions. The thread 
switch control > ..-v.3o.ster ^ [ k :rn™.' Tr,?, Moreover, the 

35 enablement of: a particular bit. can be dynamically' changed by 

either operating software or by an instruction in. one of the 
threads . 



The computer processing system may have more, than one 
thread switch cor.u-.ro.: register wherein the bit values of one 
thread switch control register differs from the bit values of 
another of said, thread switch control registers. 

Typically , there can be xsany thread switch control events, 
tor instance, a data miss from at .least one of the Hollowing: 
a LI -data cache, a L2 cache, storage of data that crosses a 
double word, boundary, or an instruction miss from at least one 
of the following: a LI -instruction cache, a translation 
lookaside bu lifer, or a data and /or instruction miss from main 
memory, or an error in address translation of data and/or an 
instruction. Access to an :i / O device external to the processor 
or to another processor may also be thread switch control 
events. Other unread switch control events comprise a forward 
progress count of a number of. times said one of. a plurality of 
threads has been switched from a one multithreaded processor 
with no instruction of the one of a plurality of threads 
executing f and a time-out period in which no useful work was 
done by the at least one processor. 

Tee computer processing system of the invention comprises 
means for processing a plurality of threads of instructions; 
ftmems for indicating when the processing means stalls because 
one of the threads experiences a processor latency eve.nL? means 
for registering a plurality of thread switch control event: sr 
and means for determining if the processor latency event is one 
of the plurality of thread switch control events. The 
processing system may also comprise means for enabling the 
processing means to sw.lv.ch processing to another thread if the 
processor latency event is a thread switch control event. 

The invention is also a method to determine the contents 
of a thread switch control register, comprising the steps of 
counting a first number of processor cycles in which a 
multithreaded processor is stalled because oli processor latency 
event and counting a second number of processor' cycles required, 
for the multithreaded processor- to switch processing of a first 
threac of instructions to a second thread of instructions,- then 
assigning the processor latency event to be a thread switch 
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control event by setting an onarie bit m the thread switch 
control register .if the first number is gr^csr tbaii the second 
tiu/nfc&r. Then if the enable hit enabled,, the method 

comprises out. putt. .1 ng a signal to switch threads when the 
i> nvui tithreadsd processor experiences the thread switch corit.ro.! 

event, if the enable hi!: is enabled. 

The method of computer processing oil the invent. ion also 
comprises tho steps of. storing a state of a thread in a thread 
state register and storing a plurality or.' thread switch control 

10 events in a thread switch corit.ro 3 register. Then when the 

star. e of." the thread changes, a signal is output to the thread 
state register and the changed state of the thread is compared 
with the plurality of thread switch control events. If the 
changed state results from a thread switch control event, a 

15 signal is outptio to a multithreaded processor to switch 

executing i:ro:? the thread. 

Or.her objects, features and characteristics or. the present 
invention; methods, operation, and functions of the related 
elements or the structure; combination of parts; and economies 

20 of manufacture will become apparent from the rod. lowing detailed 

description of the preferred embodiments and accompanying 
d.? awinqs ( all of which form; a pa.fr of." this specif ication, 
wherein Like reference numerals designate corresponding parts 
in the various figure*. 



The invention itself, however , as we! i. as a pre t erred mode 
of use, further obi actives aad advantages thereof, will nest 
be understood by reference to the .following detailed, 
description of an 1 i.iustrafive embodiment when read in 
30 conjunction with the accompany i ng drawings, wherein ; 

Figure 1 is a block diagram of a computer system capable 
of. implement log the invention described herein. 

Figure f: illustrates a high level block diagram of a 
multithreaded data processing system according tc the present 
35 invention. 
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Figure ?■ illustrates a block diagram of the. storage 
control unit of Figure 2, 

Figure 4 ii lustra tea a slock distrait- of the thread switch 
logic, the storage control unit and the instruction unit of 
3 Figure 2, 

Figure 5 illustrate the changes of state of. a thread as 
the thread expsra encss different, thi aad switch events shown in 
Figure 4 . 

Figure 6 is a flow chart of the forward progress count of. 



With reference now to the figures and in particular with 
reference to Figure 1, there is depicted a high level block 
diagram of a computer data process.! ng system 10 which may be. 

15 utilized to implement the method and system or' the present 
invert, .ion. -The primary hardvcare coffiponer. ;;.s arid 

interconnections of a computer data process a ng system 10 
capable of utilising the present invention are shown in 
Figure 1, Central processing unit (CPU : 100 for processing 

20 instructions is coupled to caches 120, 130, and ISO. 

instruction cache 150 stores instructions f or execution by CPU 
100. Dutu caches 120, 130 store data r.o h;a used by CFU a 00. 
Thi caches: oonraunicate with random access memory in y-aan memory 
140.. CPU 100 and. main memory 140 also coMsunicate via bus 

25 interface 152 with system bus 155, Various input/output 

processors (lOPs; 160-.; 68 attacn to system nus I5t and support 
coaraun a cation with a variety of storage and input/output ' i /'eg 
devices, such as d.i root access storage deva ces (DASSD5 170, tape 
drives 172, remote communication lines 1 74 , workstations 176, 

30 and printers 3 78. it should be understood that Figure 1 is 

intended to depict representative co-nponeru s of: a computer data 
processing system 10 at a high level, and that the number and 
typos of such components may vary. 

Within the CPU 100, a processor core 110 contains 

35 speciaaized functional units, each of which perform primitive 
operations, such as seqnenc ing instructions, executing 
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operations invo.Lvj.rxa jutcgcrs, executing ope ra- ions involving 
real numbers, transfer! ing values betweea addressable storage 
and logical reals tor arrays. Figure 2 illustrates a processor 
core 100. In a preferred embodiment, the processor core 100 
of the data processing system 10 is a single i nr. eg rated 
circuit,- pipelined, superscalar microprocessor., which raay be 
implemented utilising any computer architecture such as the 
family or RISC processors sold under the trade nerrcn PowerPC'^; 
for example, the PowerPC'-" (i0<- microprocessor chip sold by IB*h 
As will be discussed below, the data processing system .10 
preferably includes various units, registers, buffers, 
.memories, and other sections which are a la preferably formed 
by integrated circuitry. it should be understood that in the 
figures, the various <Jafca paths have been simplified? in 
reality , there are many separate and parallel data paths into 
and out co the various components.. Ln addition, various 
cotspo.ue.utB not germane to the inversion describe::; herein have 
been omitted,, but it is to be understood that processors 
contain additional units for add i t local functions. The data 
processing system 10 car operate according to reduced 
instruction-; set computing, RISC, techniques or other computing 
techniques , 

,«.s represented in Figure 2, the processor core 100 of the 
data processing system 10 preferably includes a level one data 
cache , Li D~ cache 120, a level two L2 cache 130, a main memory 
140, and a level one instruction cache, LI I -cache 150, all of 
which are operationally interconnected utilizing various .bus 
connections to a storage control unit 200, As shown in Figure 
1, the storage, control unit 200 includes 3 trausition cache 210 
for iru-.ercc-unec.ting the LI D- cache 120 and the. L2 cache 330. 
the main memory 140, and a plurality oi execution units. The 
11 B- cache 120 and LI I -cache 150 preferably are provided on 
chip as part of the processor 100 while the main memory 140 and 
the L2 cache 3 30 are provided off chip, Memory system 140 is 
intended r.o represent random access main memory which may or 
may not he within the processor core 100 and, and other data 
buffers avid caches, iL any, external to the processor core 100, 



■• 1 3 - 



and other external memory, for exarnpl e , DASD 170, cape drives 
172, and ^workstations 2"? 5. shown in Figure l < The 1.2 cache 130 
is preferably a higher' spaed memory sys tettx than. the main xneirtory 
140, and by scoring select-ad data within the. 1,2 cache 130, the 
5 memory latency which occurs as a result of a reference to the 

snain memory 140 car, be minimised. As shown in Figure 1, the 
L2 cache 130 and the main memory 140 are directly connected to 
both the LI I -cache 150 and an instruction unit 020 via the 
storage control naif.. 300, 

10 Instructions from the. Li I -cache 130 are preferably output 

to an instruction unit 230 which, in accordance with the method 
and system of the present invention, controls the execution of. 
multiple threads by the various subprocessor units, e.g., 
branch •.•air. 260, fixed point unit 270, storage control unit 

15 200, and floating point unit 280 and others as specified by the 
architecture of the data processing sys test 10. la addition to 
the various execution units depicted within Figure 1, those 
skilled in the art will appreciate that, iriodarn superscalar 
microprocessor systems often include mul ;..ipi e versions of each 

30 such, execution unit which may be added without departing from 

the spirit ana scope of. the present invention. Most of these 
units will have as an input source operand information from 
various registers such as general purpose registers GPRs 272, 
and floating point registers PPRs 282. additionally, multiple 

25 special purpose register SPRs 274 may be utilized. As shown 
in Figure 3, the storage control unit 200 and the transition 
cache 210 are directly connected to general purpose registers 
272 and the floating point registers 282. The general purpose 
registers 272 are connected to the special purpose registers 

30 274. 

Among the functional hardware units unique to this 
multithreaded processor 100 is the thread switch logic 400 and 
the transition cache 210. The thread switch logic 400 contains 
various registers that determine which thread will he the 
35 active or the executing thread. Thread switch logic 400 is 
operationally connected to the storage control unit 200, the 
execution units 260, 270,. and 280, and the instruction unit 
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320, The trans it ion cache 210 within the storage control unit 
200 must be capable oi' imp I era en ting muiti threading. 
Preferably, the storage control unit 200 and the transition 
cache 210 perra.it at least one outstanding data request per 
5 thread- Thus, when a first thread is suspended in response to, 

for example, the occurrence oi: LI D- cache miss, a second thread 
would be able to access the i.<1 D-cache 120 for data present 
therein. If. the second thread also result?; in Li D- cache miss, 
another data request will be issued and thus multiple data. 

10 requests must be maintained within the storage control unit 200 
and the transition cache 210, Preferably,, transition cache 2.10 
is the transition cache of U.S. Application Serial Number 
08/762,378 filed 09 December 1996 entitled ml ti* Entry Fully 
Associative Transition Cache, hereby incorporated by reference, 

15 The storage control unit 200 1 the execution units 26.0, 270, and 

280 and the instruction unit 220 are all operationally 
connected to the thread switch logic 400 which determines which 
thread to execute. 

As illustrated in Figure 2, a bus 20::, is provided between 

20 the storage control unit 200 and the instruction unit 220 for 
communication of, e.g., data requests to the storage control 
unit 200, and a L2 cache 130 miss to the instruction unit 220. 
Further, a translation lookaside buffer TLB 250 is provided 
which contains virtual -to -real address mapping. Although not 

25 illustrated within the present invention various additional 
high level memory mapping buffers may be provided such as a 

~gp >r ^ ^ \ > „ ! % , J v ~ r i * i f i a 

to the translation lookaside buffer 250. 

Figure 1 illustrates the storage control unit 200 in 

30 greater detail, and, as the name implies, this unit controls 
the input and output of data and instruct iocs from the various 
storage units, which include the various caches, buffers and 
main memory. As shown in Figure 3, the storage control unit 
200 includes the transition cache 210 functional iy connected 

35 to the 1,1 D - cache 120, .multiplexer 360,- the L2 cache 130, and 

main memory 140. Furthermore., the transition cache 210 
receives control signals from sequencers 350. The sequencers 
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3.50 .include -3 plurality of sequencers , preferably three, f.or 
handling instruction a ad /-or data fetch requests. Sequencers 
3 00 also output coni.rol signals to the transition cache 23G, 
the L3 cache 130, as well as receiving and trans?;o.tti ng control 
5 signals uo and from the main fierce ry 14 0. 

Multiplanar 360 la the. storage control unit 200 shown in 
Figure 3 receives data from the Li D- cache 120, the transition 
cache 210.. the L2 cache 3 30, main memory 140, and, if data is 
to ba stored to tserrsory, the execution units 2"'0 and 280. Data 

10 frosn one of these sources is selected by the nui tipiexer 360 
and is output to the LI D-cacha 120 or the execution units in 
response to a so.] ecu ion control signal received i'roai the 
sequencers 330. Furthermore, as shown in Figure 3, the 
sequencers 330 output a selection signal uo control a second 

15 nuilto plane •: 370. Based on this select ion signal from the 

sequencers 330, the multiplexer 3 s o >;otputs the data from the 
L2 cache 130 or the msi.n t-etsor y .140 to i.he 1 - cache ISO or 
the instruction unit 220, In producing the above - discussed 
control and selection signals, the sequencers 300 access and 

20 update the LI directory 320 for the Li D - cache 120 and the L2 
directory 3'i0 for the 1,2 cache 130. 

With res peer r.o the frail Li or; reading capability of the 
processor described herein, sequencers 350 of. the storage 
conr.ro 5. unit 230 also out: pur. signals io thread switch logic 400 

25 which ind.-c.-rfs the state of. data and instruct ion requests. So. 

feedback from the caches 120, 130 ana 130, main memory 14 0, and 
the trapsiauion lookaside buffer 5?i0 is routed u.o tbe 
sequencers 3 30 and is then co;rsnon heated to thread switch logic 
400 which .may result in a thread switch, as discussed below, 

30 >3ote that, any device wherein an e.vent. designed to cause a 

thread switch in a .oui tithre a ded processor occurs will be 
operationally connected to sequencers 330, 

Figure 4 is a logical representation and block diagram of 
the thread switch logic hardware 400 that determines whether 

35 a tnread will be switched and., if so, what thread. Storage, 

control unit 200 and instruction unit 220 are interconnected 
with thread switch logic 400. Thread switch logic 400 
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preferably is incorporated into the instruction uri.it 220 but 
if there are many threads the complexity of the thread switch 
logic 4 00 iTsay .increase so that the logic is external to the 
ins v. ruction unit 220, For ease of explanation, thread switch 
5 logic 400 is illustrated external to the instruction unit 220. 

Some events which result in a thread to be switched in 
this embodiment are co^iunicared on lines 470, 4.72 f 474 , 47 6 , 
478 , 460, 4 82, 484, and 486 frons the sequencers 3 50 of the 
storage control unit 200 to the thread switch logic 400, Other 

10 latency events can cause thread switching; this list is not 
intended to be inclusive? rather it is only representative of 
how the thread switching can be implemented, h request lor an 
instruction by either the first thread TO or the second thread 
Tl which is not m the instruction unit 220 is an event which 

15 can result in a thread switch, noted by 470 and 472 in Figure 

4, respectively. Line 474 indicates when the active thread, 
whether TO or Tl, experiences a Li c- cache :i 20 miss. Cache 
misses of the L2 cache 130 for either thread TO or Tl .is noted 
at lines 476 and 478.. respectively. tines 480 and 482 are 

20 activated v.hea data is returned for continued execution of. the 

TO thread or tor the Tl thread, respectively. Translation 
lookaside suffer misses and completion of a -able walk are 
indicated by lines 4 85 and 4 80, respectively. 

These events are ail fed into the thread switch logic 400 

25 and rrrure particularly to the thread state registers 440 arid the 

thread switch controller 450. Thread switch logic 400 has one 
thread state register for each thread. In the embodiment 
described herein,, two threads are represented so there is a TO 
state register 442 for a rlrst thread TO and a Tl state 

30 register 444 for a second thread Tl, to be described herein. 

Thread switch logic 4 00 comprises a thread switch control 
register 420 which controls what events w: n result in a thread, 
switch. For instance, the thread switch control register 410 
can block events that cause state changes from being seen by 

3S the thread switch controller 4 SO so that a thread may not be 

switched as a result of a blocked event. The thread state 
registers and the logic of changing threads are the subject of 
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U.S. application entitled Thread Switch Control in a 
MvLtithzesdsd Processor System, Serial Nuswr.er OB/? 57, 002 filed 
23 October 1S?9" concurrently herewith and herein incorporated 
by reference. The forward progress count register 420 is used 
5 to prevent Lb sashing end tn-iy oi, U.d i: d ih the thread switch 

control register A 10. The forward progress count register 420 
Is the subject of U>S> application entitled />n Apparatus and 
Met: hod c.o Gunr^rstue Forward Progress in a MulK i chreaded 
Processor, Serial trader 08/956,875 1.;. led October 1937 

10 concurrent: ly herewith and herein inceroora ■■ cd by inference, 

HQ9S?- 105. Thread s«xtch d^e-oiit register 430, the subject: 
or U.S. application entitled Method and Apparatus sc Force a 
Thie.ud Switch in a Mai r / threaded Processor, Serial dumber 
08/956 3?'? filed 23 October 199? concurrently herewith and 

1 5 her el o moo.? pora ted by r o i erencc . R09 9 - :i 07 , a 1 1 oca tes fa i rnes s 

and iiveioch issues. Alec thread priorities can be altered 
-osina software 4 60, the subject of. t',S, application entitled 
AiterirK? Thread Priori, ties in a Mu.l ti three dad Processor, Serial 
Munjber 00/i':>8,?18, filed 23 October 1997 concurrent ly herewl th 

20 and herein incorporates by reference, FG397-106. Finally, but 

not to be limitative.. the thread switch controller 450 
comprises a myriad of logic gates which represents the 
culmination of ail logic which actually determines whether a 
thread is switched, what thread, and under what circumstances. 

25 Each of these logic components and their I: unctions are set 

torth in further detail. 

Thread state registers 440 comprise a state register for 
each thread and, as the name suggests, store the state of -..he 

30 corresponding thread; in this case,, a TO thread state register 

442 and a T2 thread state register 444 , The number of bits and 
the allocation ol particular bits to describe t.he state of. each 
thread can be custoftised for a particular architecture and 
thread sw-reh priority scheme. An example oi the allocation 

35 of bits In the thread state registers 442, 444 for a 
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srsui t ithre&dsd processor havir-g two thresds is set forth in. the 
t ah I e b e 1 ot , 



Thss&d State Register Sit Allocation 

(0) Its&roetjoa/D&ta 
0 ~ Instruction 
! - IM» 
(h2s Miss jyjse sequencer 

IX) ~ Hons; 

ft? - Ttarsslsison tokasiifc hafier miss (check bii.O far M» 
10 ~ Li cache miss 
f I ~ L2 cache miss 
0> Transjtfcss 

0 a Irasisiisofi to eunssm stats does m& result in stead switch 

1 * Itartslhsfi so cmzm stase rmjiis »a stead switch 
(4:7 s Reserved 

(8) 0 ~ Load 

5 - Store 
(9; i.4) Reserved 

(IS:!?-) Forward prcspsss «ssa«r 

HI ~ Esses ssssmsesKSJ tens completeii tetsg this tixeselj 
000 = !si exeeutleta «f this te&d w/o instruction complete 
(S)l s= 2tsi execution of ?his stead w/o insuucaas complete 
010 ~ 3tti exccutkm of tbm thrgatl w/o m^xtjctiaa complete 
01 J ~ 4th execution of this thread w/o instruction assume 
100 « S£h execution of this thread w/o instruction complete 

(!8:!§j Paomy (cook! be set by software* 

00 - Medusrri 

id Bigh 

1 i - <iiiepi> 
<20:3?) Reserved 

01:dS} Reserved if M bit m$p&ra.eslsta 



In the e.!T!bod.i.mevrc describee; herein, bit 0 identifies 
whether v. he sues or the reason the processor stalled execution 
is a result of 3 request, for an Iris v. ruction or for data. Sits 
1 and 2 indicate if the requested inCorma t i on was not available 
and if so,- front what hardware, A."., wnether the translated 
address of the data or instruction was not In the trans iation 
lookaside buffer 2:":0, or the data or instruction itself;' was not 
in the LI D- cache 130 or the L2 cache 130, as further explained 
in the description of. Piqure S, Bit 3 indicates whether the 
change of state of." a thread results in a thread switch. A 
thread may change state without resulting in a thread switch. 
For instance, if a, thread switch occurs when thread Tl 



experiences an LI cache tu1.sk, then if thread Tl experiences a 
L2 cache miss, there will be no thread switch because the 
thread already switched on a LI cache -uiss. The state cf: 2*2, 
however, still changes. A I f amatively , if by choice, the 
5 thread switch logic 400 if- configured or programed nor. to 
switch on a hi cache rrdss, then when a thread does experience 
an 1.1 cache csiss, there will be h.o thread switch even though 
the Thread changes state. Bit H of. the thread nfate registers 
-14* and 444 is assigned to whether tae inferma tion requested 

10 by a particular thread is to be loaded into the processor core 
or stored from the processor core into cache or main memory. 
Bits IS through 17 are allocated to prevent thrashing, as 
discussed later with reference to the forward progress count 
register 42-1. Bits IS and 13 can he set in the hardware or 

15 could he set. by software to indicate the priority of the 

thread. 

Figure S represents four states in the present embodiment 
of a threa.l processed by the data processing system. 10 and 
these states are stored in the thread state registers 440, bit 

20 positions 1:2, state 00 represents the "ready" stare, i.e., 

the thread Is ready tor processing oecause aii data and 
instructions required are availatd.e; state 10 represents the 
thread st.av.a wherein the execution of the thread within the 
processor is stalled because the thread is waiting for return 

25 of data into eir.her the LI cache 11:0 or the return of an 

instruction into the LI 1 cache 150,- state 11 lepresents that 
the thread i.s waiting for return or. data into the L2 cache 130; 
and the state 01 indicates that there is a miss on the 
translation lookaside buffer 250, i.e., the virtual address was 

30 in error oa wasn't available, called a table wa.;,':. Also shown 

in Figure S is the. hierarchy of thread states wherein state 00. 
which indicates the thread Is ready for execution,, has the 
highest priority, Short, .latency events are preferably assigned 
a h i g he r p r i o r i t y , 

35 Figure 5 also illustrates the change of states when data 

is retrieved fro?;? various sources. The normal uninterrupted 
execution of a thread TO is represented in block 510 as state 
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00. If a. LI D ■ cache or I -cache miss occurs, the thread state 
changes to state 10, as represented in block 512, pursuant to 
a signal seat on line (Figure 4) from the storage control 

unit 200 or line 470 (Figure 4) from the instruction unit 220- 
5 respectively, ill the required data or instruction is in the 
1.2 cache 130 and is retrieved., then normal execution of TO 
resumes at block 510, Similarly block 514 of Figure 5 
represents a 1,2 cache miss which charges the state of thread 
of either TO or Tl r.o state 2.1 when storage control unit 200 

10 signals the miss on iin&s 476 or 478 {Figure 4). When the 
instructions or data in the L2 cache are: retrieved from main 
me/nory 140 and loaded inv.c the processor core 100 as indicated 
on lines 4? ; 0 and 432 (Figure 4), the state again changes bach 
to state 00 at block 510. The storage control unit 000 

15 communicates to the thread registers 440 on line 484 (Figure 

4} when the virtual address lor requested information is not 
available la the translation lookaside bofier 250, indicated 
as block 5 10,. as a TLB suss or state 01, when the address does 
become available or ii! there is a data storage interrupt 

20 instruct ion as signaled by the storage control unit 200 on line 

4S6 {Figure 45, the state of the thread then returns to state 
00, .meaning ready for execution. 

'The number of states, and what each state represents is 
freely selectable by the computer architect. For ins Lance, if." 

25 a thread has multiple LI cache misses, such as both a Li i- 

cache miss and LI D-cache miss- a separate state can be 
assigned to each type of cache miss. alternatively, a single 
thread state could be assigned to represent, more than one event 
or occurrence. An example oil a thread switch algorithm for 

30 two threads of equal priority which determines whether to 

switch threads is given. The algorithm can be expanded and 
modified accordingly for more threads and thread switch 
conditions according to the teachings or. the invention. The 
interactions between the state or each thread stored in the 

35 thread state registers 440 (Figure 4) and the priority of each 
thread by the thread switching algorithm are dynamically 
interrogated each cycle. If the active thread TO has a Li 
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jaiss, the algorithm will cause a thread switch to the dormant 
thread Tl unless Che dormant thread Tl is waiting ror 
reso.luf.iori of a U2 miss. If. a switch did noi occur and the Ll 
cache miss cl" active thread TO tnrns into a L2 cache, miss, the 
5 algorithm then directs the processor to switch f.c the dormant 

thread Tl regardless of the Tl's stats. If both threads are 
wa i i: l.nq for resolution of: a L2 cache itiss, the thread first; 
having the h2 miss being resolved becomes the active thread. 
At every switch decision tine., the jction taken .is optimised 
10 f.or the most likely case, resoltiaa in ;:he best performance. 

fciote that thread switches resni t.i ng fro;s a L2 cache nnss are 
conditions! on the state of the other thread., if not extra 
thread switches would occur resu 1 ting An loss of perior manes . 

IS In any mi; it A threaded processor, there are latency and 

performance penalties 3ssocl.3r.ed with swA toting threads. In 
the multithreaded processor in the preferred embodiment 
described herein, this latency includes the time required to 
compiete execution of the current thread to a point where it 

20 can be .interrupted and correctly restart is d when it is next 

invoked, the time required to switch the thread- specific 
hardware facilit.ies fron> the current thread's state to -.he new 
thread's state, and the time required to restart the new thread 
and begin its execution, Prei erab ly the t nread - specific 

25 hardware tadA it.ies operable with the invention include the 

thread state registers described above and tne memory celis 
described .in U.S. Patent /'riB ,2A'i entitled Nul v .i threaded Cel 1 
for a #e;?;o.ry, herein incorporated by reference. In order to 
achieve optimal performance in a coarse grained multithreaded 

30 data processing system, the latency ot 3n event which generates 

a th.re.ad switch snust be greater than the performance cost 
associated with switching threads in a multithreaded tsode, as 
opposed to the normal single ■ threaded mode. 

The latency' Of en event used to generate a thread switch 

35 is dependent upon both hardware and scrtware. For e.xa.mp.i.e f 

specific hardware considerations in a multithreaded processor 



-22-- 



include the speed of external SRAMs used to implement an L2 
cache external to the processor chip. Past SEAMs in the L2 
cache reduce the average latency of an LI miss while slower 
SRAM? increase the average .Latency of. an miss. Thus, 
performance is gained if ens thread switch event is defined as 
a LI cache miss in hardware having an external L2 cache data 
access .Latency greater than the thread switch penalty. As an 
example of how specific software code characteristics affect 
the latency oil thread switch events, consider the L2 cache hit- 
to -miss ratio of: the code, i,e, ; the, number of times data is 
actually available in the L2 cache compared to the aumber of 
times data must be reprieved from main memory because data is 
not in i ho L2 cache. A high Ll hit •■ in- rvi ss ratio reduces the 
atency of an LI cache miss because the LI cache miss 
suits in a longer latency L2 miss, A low L2 hit-to- 
miss ratio increases the average latency of a.o Li miss because 
.more LI misses result in longer latency L2 hisses < Thus, a Li 
cache miss could be disabled as a thread switch event if the 
executing code has a high 1,2 hi t - to -mis a ratio because the L2 
cache data access latency is less than the thread switch 
penalty, a Li cache miss would he enabled as a thread switch 
event when executing software code with a .low L2 hit -to -miss 
ratio because the Li cache miss is likely to turn into a longer 
iatenoy L2 cache miss. 

Some types of latency events are not readily detectable. 
For instance, in some systems the L2 cache outputs a signal to 
the instruction unit when a cache miss occurs. Other L2 
caches, however, do not output such a signal, as in for 
example,, if the L2 cache cent roller were on a separate chip 
from the processor and accordingly, the processor cannot 
readiiy determine a state change. In these architectures, the 
processor can include a cycle counter for each outstanding LI 
cache miss. If the miss data has not been returned from the 
L2 cache after a predetermined number of cycles, the processor 
acts as if. there had been a L2 cache miss and changes the 
thread's state accordingly. This algorithm is also applicable 
to other cases where, there are more than one distinct type of 
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latency. As an example only, for a Li; cache miss in a 
rrai I ti processor, the latency of data fro?n Tsain rruMTiory may be 
significantly different than the latency of. data from another 
procssKo; . These two ".vests may be assigned di f. f eranr. states 
5 in the thread state register. jf no sxaoal exists to 
distinguish the stares, a counter may be used to estimate which 
state the thread should be in after it encounters a L2 cache 

The thread switch control register 410 is a software 

10 programmable register which selects the events to generate 

thread switching and has a separate cnanle bit for each defined 
thread switch control event, /lit hough the embod.i menr 

described serein dees not isnplu-rasnt a separate thread switch 
control reals ten 410 for o^ch thread., separate thread swi -.oh 

IS control, registers 410 tor each ttrread could be implemented to 

provide ire re flexibility arid performance at the cost of raore 
hardware and complex! ty. Moreover,, the thread switch control 
events in one thread switch control register need not be 
identical to the thread switch control events re. any other 

20 thread switch control register. 

The thread switch control register 3 10 can be written by 
a service processor with software such as a dynavs-ic scan 
cofamuni cat ions interface di sciosed ha U.S. 5- a tent wo. 5,0^9,725 
entitled Chip identi tlcacion Method for tbve with Scan Design 

25 Sysvems and Scan 'festr/rv Tecnniqaes* or by the processor itself 

with software system code. The contents of the thread switch 
control register 4 10 is used by the thread switch controller 
450 to enaoie or disable the generation of a thread switch. 
A value of. one in the register 410 enables the thread switch 

30 control event associated with that bit to generate, a thread 

switch. A value ov. zero in the thread switch control register 
410 disables the thread switch centre.! event, associated with 
that bit frc/n generating a thread switch. Of course, an 
instruction in the executing thread could disable any or all 

35 of the thread switch conditions tcr that particular or for 
other threads. The 'hollowing table shows the association 
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As discussed above.. course grained multithreaded 
processors rely on long latency everts to trigger thread 

40 switching. Sometimes during execu.Lion., e processor in a. 

multiprocessor enva ronfPerU: or a backg.roi.ur.! thread in a 
multithreaded architecture, has ownership of a resource that 
can have only a si.ng.le owner and another processor or active 
thread requires access to the resource before it can raa'fce 

45 fox-ward progress, Examples include updating 3 memory page 
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table or obtaining a task from a task dispatcher. The 
inability of the active: tnread to obtain ownership of the 
resource does not result in a thread switch , nonetheless, 

the thread is spinning in a loop unable to do useful work. m 
this case, the background thread that holds the resource does 
not obtain access to the processor so that it can tree up the 
resource because it never encountered a thread switch event and 
does not become t.he active thread, 

Allocating processing cycles among the threads is another 
concern; if software code running on a thread soldo;?; encounters 
long .latency switch events compared to software code running 
on the other threads in the san-e px ocessor , thac thread wXll 
get more than it's i : a: : r share of." processing cycles. Vet 
another excessive delay that -may exceed the rr^ . I n\\.m acceptable 
ti.?te is the latency of an inactive tnccad waiting to service 
an external interrupt with ire a Wanted period of: time or same 
other evert external to the processor. Thus,, it becomes 
preferable to force a thread switch to the dormant thread after 
some time if no useful processing is being acoorapi ished to 
prevent the system from hanging. 

The logic to force a thread switch after a period of time 
is a thread switch tiite-out resistor 430 ; Figure 4}.. a 
decrement e , and a decrementer register to hold the decremented 
value. The thread swi ten time-out register 430 holds a thread 
sw.i t.eh time-out value. The thread sw.lt oh time -out register 430 
implementation used in this embedment is shown in the 
to U owing tablet 
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'['he einbooiment oi! the invention described herein does not 
implement a separate thread switch time-out register 430 for 
each thread, although that ceroid he done to provide more 
flexibility. Similarly, it there are multiple threads, each 
35 thread need not. have the same thread switch time-out. value. 
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Bach time a thread switch occurs, the thread switch time-out 
value from the thread" switch time-out register 430 is loaded 
by hardware into the decrement register. The decrement 
regi ster is decremented once each eye is urir..il the decrement 
register value equals zero, then a signal is sent to the thread 
switch controller 450 which forces a thread switch unless no 
other thread is ready to process instructions. For example,, 
if all other threads ia the system are waiting on a cache miss 
and are not ready to execute Instructions, the thread switch 
controller 450 does not force a thread switch. if no other 
thread is ready to process instructions when the value in the 
register reaches zero, the decremented value is 
zero until another thread is ready to process 
instructions, at which point a thread switch occurs and the 
decrement register is reloaded with a thread switch time-out 
value for that thread. similarly, the decrement register could 
just as easily be named an increment register and when a thread 
is executing. the register could increment up to some 
predetermined value when a thread switch would be forced. 

The thread switch, time-out register 4 30 can be: written by 
a service processor as described above or by the processor 
itself with software code. The thread switch time -cut value 
loaded into the thread switch time-out register 430 can be 
cus t om.i. z ed a coord i ng t c spec;! r i c ha rdws r e cot: f j gu r a t i on a nd / or 
,>\ lLw b tiu*. wv\, n - r ft t~i \ ^ ^ ^ ^ j a 
0" r c^c^av 'hca <• ' - » < l >> g - a value in the 
thread switch time-out register <O0 can result in reduced 
performance when the active thread is waiting for a resource 
held by another thread or if.' response latency for an external 
interrupt or some other event external to the processor is too 
long. Too high, of a value can also prevent fairness if one 
thread experiences a high, nim&er or thread swit.oh events and 
the other does not. A thread switch time-out value twice to 
■■an the aiost frequent longest latency 
thread switch is recommended, e.g., access 
to main memory. Forcing a thread switch after wait lug the 
number of cycles specified in the thread .switch time-out. 
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register 430 pr events system nasgs due to shared resource 
con Lent: ion .. enforces fairness of processor cycle allocation 
between threads, and limits the maximum response latency to 
external interrupts and other events ex:terr;al to the processor. 

That: at least one instruction must be executed each time 
a thread switch occurs and a new thread becomes active is Loo 
restrictive in certain oi resistances > such as when a single 
instruction generates multiple cache accesses and/or multiple 

10 cache misses. For example, a fetch instruction may cause an 

LI i-cachfe 150 miss if the instruction requested is not in the 
cache; but when the instruction returns, required data may not 
be available in the Ll Dcache I2D. Likewise, a raise in 
translation lookaside buffer 250 can a.Iso result in a data 

15 cache miss. So, if forward progress is strictly enforced,. 

misses on subsequent accesses do not result in thread switches, 
A second problem is that s curse cache misses .may require a large 
nuitber of cycles to complete, during which, time another thread 
may experience a cache miss at the same cache level which can 

20 be completed in much less time. If- when returning to the 

first thread,, the strict: forward progress is enforced, the 
processor is unable to switch to the thread with the shorter 
cache miss. To remedy the problem of thrashing wherein each 
thread is rocked in a repetitive cycle of switching threads 

25 without any instructions executing , there exists a forward 

progress oouxvt register S2G {Figure 4} which allows up to a 
programmabi e maximum number of thread switches called the 
forward progress threshold value. After that maximum nvmxber 
of thread switches, an instruction mvjst be completed before 

30 switching can occur again, in this way, thrashing is 
prevented. Forward progress count register 4 2G may actually 
be bits 30:31 in the thread switch control register 410 or a 
software programmable forward progress threshold register for 
the processor. The forward progress count logic uses bits 

35 I5tl7 of the thread state registers 442, 444 that indicate the 

state of the threads and are allocated for the lurmber of thread 
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switches a thread has experienced without an ins traction 
executing. Preferably, then these bits comprise the forward 
progress counter ,. 

When a thread changes state invoking the thread switch 
5 algorithm, if at least one ins trace ion has completed in the 
active thread, the forward •• prog res s counter for the. active 
thread is reset and the thread switch algorithm continues to 
compare thread states between the threads in the processor, 
if no instruction has oompl eted , the forward - progress counter 

10 value in the thread sr. ate register of the active thread is 

compared to the forward progress threshold value. If. the 
counter value is not equal to the threshold value, the thread 
switch algorithm continues to evaluate the thread states or the 
threads in the processor, Then if a thread switch occurs, the 

15 forward -progress counter is incremented. If, however, the 
counter value is equal to the threshold value, no thread switch 
v/ill occur until an instruction can execute- i.e., unci! 
forward progress occurs. Note that if the threshold register 
has value sero, at least one insj.ruot.ion must complete within 

20 the active thread before switch inn to another thread- r.f each 

thread switch r squires three processor cycles and if. there are 
two threads and if the thread switch logic is programmed to 
stop trying to switch threads after five tries.; then the 
maximum number of oyeies that the processor will thrust? is 

25 thirty cycles. One of skill in the art can. appreciate that 
there a potential conflict exists between prohibiting a thread 
switch because no forward progress will be made ou one hand 
and, on the other hand, forcing a thread switch because the 
time-out count has been exceeded.. Such a conflict can easily 

30 be resolved according to architecture and software. Figure 
6 is a flowchart of the forward progress count feature of 
thread switch logic 400 which prevents thrashing, Ac block 
?>10, hits 5 5:17 in thread state register 442 pertaining to 
thread TO are rssst to state 2..U. Szecution c.f this thread is 

35 attempted S u block 620 and. the state changes to 000. If an 

instruction successfully executes on thread TO, the state of 
thread TO returns to 111 and. remains so. If, however , thread 
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forward progress logic improves processor per £ ormanoe < One way 
to handle these extre;uely long latency events is to block the 
incrementing of:' the forward progress count. er or ignore the 
output signal of v. he comparl sou between the torward progress 
5 counter and the threshold value .if. data has not returned. 

Another way to handle extremely long latency events is to use 
a separate larger forward progress count, for these particular 
events < 



10 The thread stats i or all software threads dispatched to 

the processor Is preferably maintained in the thread state 
registers 442 and 444 of Figure 4 as described, in a single 
processor one thread executes its instructions at a time and 
ail other threads are dormant. Execution is switched from the 

1.5 active thread to a dormant thread when Cr.a active thread 

encounters a ioug ■ latency event as discussed above with respect 
to the forward progress register 4SG, the thread switch control 
register 410, or the thread switch time-out register 430. 
Independent of which thread is active, these hardware .registers 

20 use coudi tions that do not dynamical i.y change during the course 
of ex eon t ion. 

Flexibility to change thread switch conditions by a thread 
switch manager improves overall system performance.. A software 
■ thread switch manager can alter the frequency ox thread 

25 switching, increase execution cycles available, for a critical 

task, and decrease the overall cycles lost because of." thread 
switch latency. 'The thread switch manager can .be programmed 
cither at compile time or during execution by the operating 
system., e.g., a .locking loop can change the frequency of thread 

30 switches; or an operating system task can .ue dispatched because 

a dormant, ttuead in a lover priority state is waiting for an 
external interrupt or is otherwise ready, it may be 
advantageous to disallow or decrease the frequency of thread 
switches away from a.o active thread so that performance of the 

35 current instruction stream does not sutfer the latencies 
resniting from switching into and out of it. Alternatively, 
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a thread can forgo some or aid of its execution cycles by 
essentially lowering its priority, and as a result, decrease 
the frequency of: switches into it or increase the frequency of 
switches ouv. of the thread to enhance overall system 
5 performance. The thread sw.ir.ch manager may si so 

unconditionally force or inhibit a thread switch, or influence 
which thread is next, selected for execution. 

A multiple -priority thread switching scheme assigns a 
priority value to each thread to Qualify the conditions that 

10 cause a switch, rt -f;ay also be desirable .in some cases to have 

the hardware alter thread priority. For instance, a low- 
priority thread jaay be waiting on soma event, which when it 
occurs, the bardv/are can raise the priority of the thread to 
influence the response time of the thread to the event, 

IS Relative priorities between, threads or the priority of a 

certain thread will influence the handling of such an event. 
The priorities of the threads can be adjusted by the thread 
switch manager software through the use of one or more 
instructions, or by hardware in response to an event.. The 

20 thread switch onager alters trie actions performed by the 

hardware thread switch logic to effectively change the. relative 
p r i o r .; t y o f. t h e f. h r •?. a d s . 

Three priori.ui.es are used with the embodiment described 
her e i n of t wo f b r ea d s a nd pr o v i do s sn f. f i o i en t d i s t i n c t .ion 

25 between threads to allow tuning of performance without 
adversely a" recti nq system performance. With fbree priorities, 
two threads car; have an equal status of. medium priority. The 
choice of three priorities for two threads is not intended to 
be limiting. in some architectures a "nornsa.l :1 state yaay be 

30 that one thread always has a higher priority than the other 
threads- It is intended to be within the scope of the 
invention to cover more than: two threads of execution having 
one or multiple priorities that can he set in hardware or 
programed by software . 

35 The three priorities of each thread are high, medium, and 

low, When the priority of thread TO is the same as thread Tl t 
there is no c if tot on the thread switching logic. Both threads 
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have, equal priority so neither .1.3 given an execution time 
advantage, When the priority of thread TO is greater than the 
priority ni: thread Tl,. thread swi tehing frojt ^0 to is 
disabled for ail Li cache misses, i.e.,- data load, data store, 
5 and instruction f fitch, because LI cache rrsisses are resolved 

much raster than other conditions such as L2 hisses and 
translates. Thread TO is given a better chance oil receiving 
more execution cycles than thread Tl which a i. lows thread TO to 
continue execution so long as iv. does not waste an excessive 

10 number of: execution cycles . The processor. t however, will still 

relinquish ooni.ro.; to thread Tl if thread TO experiences a 
relatively ion;;; execution latency. 'Thread switching from !?i ? 
to iff? Is una ilf noted, except that a switch occurs when dormant 
thread TO i$ ready in which case thread TQ preempts thread Tl . 

IS This case would be expected to occur when thread TO switches 

away because of an hi 5 - cache trass or translation roguest, and 
the condition is resolved in the background while thread TO is 
executing. The case of ibread TO having a priority less than 
thread Tl is analogous to the ca«e above, with the thread 

20 designation reversed. 

There are different, possin.le approaches to implementing 
rcanagefuent of. thread switching hy ctao::-; r\a thread priority, 
&ew instructions can be added to the processor architecture. 
Existing processor instructions having side effects that have 

25 the desired actions can also be used. Several -.actors 

influence the choice among the methods of allowing software 
control : -a; the ease of redei'ining architecture to include 
new instructions and the effect of .architecture changes cn. 
existing processors; 0:;S the desirability of running identical 

30 software on different versions of. processors; ic) the 

performance, tradeoffs between using new, special purpose 
instructions -versus rousing ex.ist.ir-g instructions and defining 
resultant side effects; hi; the desired, level of ccntroi by the 
software, e.g... whether the effect can be caused by every 

35 execution of some existing instruction, such as a specific load 
o.r store, or whether snore control is needed, by adding an 
instruction to the stream to specifically cause the effect. 
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The archi teet.ure described herein preferably takes 
advantage of:' an unused instruction vvhose values do not. change 
the architected general purpose registers of the processor; 
t h i s f ea t a re i s c r i t i ca .1. f o r r e t ro r. ■ r. t i r> q mu 1 t i t h r e a d i ng 
5 capabilities into a processor arch if. ecture , Otherwise special 

instructions can be coded. The instruction is a "preferred 
nop" or Q,0 f 0t other .instructions , however,- car* effectively act 
as o nop. By using deferent versions of the or instruction, 
or 0,G,Q or 1,1,1 etc. tc a iter thread priority, the same 

10 instruction stream may execute on a processor without adverse 

affects such as illegal .■ ns cruet ion Interrupts. An extension 
uses the state of the machine state register to alter the 
meaning, or these Instructions . For example, it may be 
undeai.rab.ie to alio*- a user to code eorte or ail of those thread 

15 prior.l!;y instructions and access the functions they provide. 

The special functions they provide may be do £ mod to occur only 
in certain modes of. execution, they will have no effect in 
other modes and wi.il he executed normally, as a nop. 

One possible implements t i on , using a dual -thread 

20 am- i ti threaded processes { uses three instructions v/hieu become 

pari, oi the executing software Itself to change the. priority 
of itself; 
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? ' ~ NOTE Ody valid ia prh'ile^d male 

30 Instructions tsop 1 and tsop 2 can he. the same instruction 

as embodied herein as or 1,1,1 bur. they can also be separate 
instructions. These instructions interact with hits 13 and 21 
of the tnread switch centred. register 410 and the 
pxoblerii/p:; ivii ege bit. of the machine state register as 

35 described herein. If bit 21 of the thread switch control 

register 4 10 has a value of one, the thread switch manager can 
set the pr ior.rty of its thread to one of three priorities 
represented in the thread state register at bits IS: 19. If bit. 



IS of the thread switch control register 41! Q has a value zero, 
then the instruction tsop 2 thread switch arid thread priority 
setting is controlled by the probiem/pr ivi lege bit of the 
;r-ach.ine state regis t si . On the other hand., if. bit 33 of -:.he 
5 thread switch control register 410 has a value one,- or if the 
problem/privilege hit of. rhe. .machine state register has a value 
aero and the instruction or 1*1,1 is present in the code, the 
priority for the active thread is set to iow- arid execution is 
imoiediats.lv switched to the dormant or background thread if the 

10 dormant thread is enabled. The instruction or 2,2,2 sets the 

priority of the active thread to medium regardless of the value 
of the problem/privilege hit of the machine state register. 
And the instruction or 3,3,3, when the problem/privilege bit 
st i •> r i ; * -"-n >■ the 

15 priority of the active thread to high. If bit 21 of the thread 
switch control register 320 is aero, the priority for both 
threads is set to medium and the effect oi the or x,x,x 
instructions on the priority is blocked, if an external 
interrupt request is active, and if the corresponding thread's 

20 priority is low, that thread's priority is set to medium. 

The events altered by the thread priorities are; {1} 
switch on LI D- cache miss to load data? {2} switch on LI D- 
caehe kuss for storing data; (3) switch on hi I -cache miss on 
an. instruction, fetch? and (4) switch if the dormant thread in 

25 ready state. In addition; external interrupt activation may 

alter the corresponding thread's priority. The following table 
shows tne effect of. priority on conditions thai, cause a thread 
switch, h simple TSC entry in columns three cmd four means to 
use the conditions set forth in the thread switch control {TSC; 

30 register 410 to initiate a thread switch. An entry of TSC (0:3) 

treated as 0 means that bits 0<2 of the thread switch control 
register -310 are treated as if the value of those bits arc :;.ero 
for that thread and the other bits .in the thread switch control 
register 4 10 are used as is for defining the conditions that 

35 cause thread switches. The phrase when thread TO ready in 
column four means that a switch to thread TO occurs as soon as 
thread TO is no longer watting on the miss event that caused 
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.it fo be switched out:. The phrase wh«n chrsad Tl ready in 
eoiurfin 3 in saris that a switch to thread 'Pi occurs as soon as 
thread Tl. is no longer waiting on the rciss event that caused 
it to be switched out. Ir" the miss event is a thread switch 
5 time-out:., there is no guarantee that the lower priority thread 

completes ail instruction before the higher priority thread 
switches back in. 
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It .is recommended that, a thread doiiv;; no productive work 
he qiven low priority to avo.id a loss in performance even if 

20 every i as true hi on in the idle loop causes a thread switch. 

Yet., it is at ill important to allow hardware to alter thread 
priority if: a a externa.] interrupt is requested to a thread set 
at low priority, r.a this case the thread is raised to mediuja 
priority., to allow a quicker response to trie interrupt, This 

25 allows a thread waiting on an external event to set itself, at 

low priority,- where it will stay until the event is signalled. 

while the invention has been described in connect. -A on with 
•what is presently considered the stost practical and preferred 
essbod.iments, it is to be understood that the invention is not 

30 limited to the disclosed emhod ii-sents .. but on the contrary, is 

intended to cover various modifications and equivalent 
arrangerasnts included within the spirit and scope or. the 
appended claims . 
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Claims 

1. A computer processor comprising: 

at least one mulci threaded processor {100} to switch 
execution between a plurality of threads of 
instructions ; and 

at Isasr. one. chread switch control register (4.10) having 
a plural icy of bics, each oc said bits associated 
uniquely with one of. a plurality of thread switch 
control events, uhe at least one thread switch 
cc.nt.rol register interconnected with the 
ran J. t i thx saded processor . 

2. The processor of Claim 1 wherein it one of che bits is 
enabled, -.he chread switch cooCrol event associated with that 
bit causes the at least one multi threaded processor to switch 
rroir- one of" a plural icy of threads to another of said plurality 
of threads, 

3. The processor of Claim 1 or 2 wherein the thread switch 
con t roi reg i s r. er i s pr ograrfur>a b 1 e . 

4. The processor of one of Ciairus 1 to > wherein at. least one 
inseruccion n disable ac lease one oi : the bits in che chread 
s w i v. ch co.o t r :., X r e g i s c e r . 

5. Tbe processor o£ one of Claims i to 4 comprising mors than 
one thread switch control ^agister, 

6. The processor of daiif f> wherein the bit values of one 
Chread switch control register differs from r,h« bit values of 
another of said chread switch concroi registers. 

7. The processor of. one of Claims I to 6 wherein the 
piuraliuy of chread switch concroi evenus comprise a data raiss 
from at lease one of che tol lowing ; a Id -data cache, a h2 
cache, a translation lookaside buffer. 
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8. The processor of. one of ciairas 1 to ? wherein the 

plurality or thread switch coat roll events comprise an 

instruction ;niss frost at "least, one of: the following; a Ll - 
instruction cache, a translation lookaside buffer, 

3. The processor of one or. Claims l to 8 -therein the 
plurality of thread switch control events comprise an error in 
address translation of data and /or an instruction. 

10, The processor of.' one of Claims 1 to 9 wherein the 
plurality of thread switch control events comprise access to 
an 1/0 device external to said processor. 

:ii. The processor of one or. Claims 1 to :i 0 wherein the 
plurality of r.hread switch control events comprise access to 
another proces sor , 

12. The processor of one of Claims 2 to 11 wherein the 
plurality of thread switch control events comprise a forward 
progress count of a number of times the one of a plurality of 
threads has .been switched from she at ieast one multithreaded 
processor with no instruction of the oae of a plurality of 
f h r a a ds e xc cu i: .1 nq . 

13. The processor of one of Cialsns 1 to 12 wherein the 
plurality of. thread switch control events comprise a time-cue 
period. 

1-1. A coisputer processing system comprising; 

means for processing a plurality of threads of 
instructions; 

means for indicating when the processing «ueans stalls 
because one of the plu.ra.lity of threads experiences 
5 processor latency avanr,; 

D.ear;s for registering a plurality of thread switch 
control events; a ad 
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9 r^eaus for determining if the processor lu>:enoy even-; is 

10 one of the plurality of. thread switch control 

1 IS. The computer processing system of Claim 14, further 



comprisi nqi 

nmans for enabling the processing means to switch 
processing to another- of the plurality of threads if 
the processor latency event is one of. the plurality 
of thread switch control events. 

16. A method to determine contemn of a thread switch controi 
register, comprising the steps of: 

counting a first number of processor cycles in which a 
stuiti threaded processor is stalled because oil 
processor latency event ; 
counting a second number of processor cycles required for 
the muiti threaded processor- to swl ceh processing of 
a first thread of instructions to a second thread of 
instructions; 



1.0 assigning the processor latency event to be a thread 

11 switch controi event toy setting an enable hit in the 

12 thread switch controi register if the Hirse number 

13 is greater than she second number , 

1 1? . *-\ ' \ ' ,u\t «it - ■> ' uc- 

2 ontputting a s.-.gnal to switch threads when the 

3 yuuiti threaded processor experiences the thread 

4 switch controi event: if the enable bi r is enabled. 



1 18. A iaethod of computer processing comprising the steps of; 

2 storing a state of a thread in a thread state register? 

3 storing a plurality of thread switch: control events in a 

4 thread switch control register; 

5 signaling the thread state register when the state of the 

6 thread changes; 
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comparing the changed at at* of the thread with the 
plurality of thread switch control events, 

19, The method or. Claim IB, further comprising; 
signalling a a-u .1 ti threaded processor to switch execution 

from the thread if. the.-, charged state results from a 
thread sv.tif.ch control event, 

20. A computer sys tern, coispr Ising ; 

a multithreaded processor il 00; capable of: switching 
processing between at least two threads of 
i ns t rue t i ons when th e mu 1 1 i t h r ea d ed proeas s cr 
exper.ie.nces one of a plurality of processor latency 
events ? 

at least oo& thread state register {4.40} operativeiy 
ocnneetsd to the- mui t.i threaded processor to store a 
state oi the threads of instructions wherein the 



10 state of. each thread, of ,i nsr ructions changes when 

11 trie processor switches processing to each thread; 

12 a t. least one thread switch con t red. register (4:10) 

13 operativeiy connected to the at least one thread 

14 state register and to the sauiti threaded processor,. 
35 to store a plurality of thread switch control events 
3.6 which tnread switch control events are enabled by 
17 setting a correspond ing plurality of enable bits; 
lf> a plurality of internal connections connecting the 

19 multithreaded processor to a plurality oi; memory 

20 elements U20, 130, 140, ISO; wherein access co any 

21 of the plurality of rkeruory elements by the 

22 rnuiti threaded processor causes a processor latency 

23 event; 

24 wherein when one of.' the threads executing in the 

25 multithreaded processor is unsoie t^ continue execution 

26 .because of one of the processor latency events and when 

27 that processor latency event is a thread switch control 

28 event and when that corresponding enable bit is set, the 



multithreaded processor switches execution r.o another or 
the threads. 

2.1. The eorrspuner system of. Ciais- 20 farther eon-pri si t\g at. 
least one eternal connection QSs) 
multithreaded processor to at least orst. external i 
(170) f at least one external corrsrnunlcation device {174} , an 
external computer network (374}.. or -Jit .least one xnpu "/output 
device iiw; therein access to any of the device? or the 
network by the ncal t .1 threaded processor causes one of the 
plurality ot processor latency events. 
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